The pink salmon genome: Uncovering the genomic consequences of a two-year life cycle

Pink salmon (Oncorhynchus gorbuscha) adults are the smallest of the five Pacific salmon native to the western Pacific Ocean. Pink salmon are also the most abundant of these species and account for a large proportion of the commercial value of the salmon fishery worldwide. A two-year life history of pink salmon generates temporally isolated populations that spawn either in even-years or odd-years. To uncover the influence of this genetic isolation, reference genome assemblies were generated for each year-class and whole genome re-sequencing data was collected from salmon of both year-classes. The salmon were sampled from six Canadian rivers and one Japanese river. At multiple centromeres we identified peaks of Fst between year-classes that were millions of base-pairs long. The largest Fst peak was also associated with a million base-pair chromosomal polymorphism found in the odd-year genome near a centromere. These Fst peaks may be the result of a centromere drive or a combination of reduced recombination and genetic drift, and they could influence speciation. Other regions of the genome influenced by odd-year and even-year temporal isolation and tentatively under selection were mostly associated with genes related to immune function, organ development/maintenance, and behaviour.


Introduction
Pink salmon are an economically important species under heavy exploitation and have been the subject of intense mitigation efforts to maintain current levels of exploitation. Commercial catches of pink salmon comprise roughly half of all Pacific salmon catches by weight and a much greater percentage by count as they are the smallest of the commercially important Pacific salmon [1,2]. Since the late 1980s, more than a billion pink salmon are released annually from hatcheries [1] to maintain the abundance of this fishery. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The native range of pink salmon encompasses parts of the southern Arctic Ocean between North America and Asia as well as much of the northern Pacific Ocean [3]. Recently, Arctic climate warming has opened previously inaccessible Arctic territory to pink salmon as well [4][5][6]. Pink salmon have been introduced to the Great Lakes in North America [7] and drainage basins of the White Sea (reviewed in [8]) near the border of Russia and Finland.
Pink salmon spend a year and a half at sea before returning to rivers to spawn at two-years of age. This near-universal two-year life history, unique to this species among salmon, has wide-ranging implications for their evolution, conservation, and possibly for their future as a species. Gene flow between year-classes/lineages is limited [9] (this phenomenon is known as allochronic or temporal isolation). There are, although very rare, exceptions that have been noted to a two-year life-cycle of pink salmon in their native range (i.e., only a few individuals have ever been reported in the literature [10][11][12]). Outside their native range, three-year-old pink salmon have been observed in the Great Lakes following introduction [7,13]. One hypothesis to explain the development of one-year-old spawning in pink salmon, based on experimental rearing in heated sea water, is that temperature may play a role in precocious development [14].
Within a year-class, population genetic differentiation among rivers tends to be lower than that of other salmon species, which is a possible consequence of increased straying of pink salmon from natal streams during spawning [15,16]. Increased straying itself may be a repercussion of the reduced time that pink salmon spend in their natal streams and the reduced time they have for imprinting on that stream compared to most other salmon species (chum salmon-Oncorhynchus keta being an exception, but chum salmon also have lower genetic diversity [3,17,18]). Pink salmon are ready for sea migration as soon as they emerge from gravel and after yolk-sac absorption [19].
In contrast to the regional reduced heterogeneity observed within year-class populations, there is a high level of divergence between year-classes as a result of limited gene flow [9,[20][21][22][23][24]. Genetic differentiation between odd and even lineages from the same river is greater than within year-class differentiation, a phenomenon observed across the species natural range [25]. There are also phenotypic differences that have been reported between lineages such as gill raker counts [21], length/size (with even-year fish tending to be smaller in Canada) [26][27][28], and survival/alevin growth in low-temperature environments [29].
The divergence of pink salmon from other Pacific salmon species has been estimated to have occurred several million years ago [30][31][32][33][34]; this provides a maximal time of odd and even lineage divergence. Based on mitochondrial nucleotide diversity, divergence times between odd and even-year lineages have previously been estimated as 23,600 years [35], 150-608 thousand years ago [36], and 0.9-1.1 million years ago [24]. The relatively recent estimates of divergence are inconsistent with complete temporal isolation between odd and even lineages (potentially for several million years). It has been suggested that low-level gene flow or recolonization of extirpated year-classes by alternate year-classes could account for recent estimates of divergence, with recolonization being a favoured explanation [35]. Both low-level gene flow and recolonization (where an even-year population was established from an odd-year population) have been observed in introduced pink salmon in the North American Great Lakes [7,37,38], revealing that it is possible that environment and temperature (suggested in [38,39]) can alter the allochronic isolation observed in modern times.
While odd-year and even-year pink salmon populations may occupy the same environment (during different years), these lineages can still have different selective pressures [40]. For example, the density of pink salmon is known to vary between years [40,41], and density may influence the composition of pink salmon predators, prey, and the number of fish on the spawning grounds [42][43][44]. In years with a high abundance of pink salmon, some studies have reported a decrease in body size of pink salmon at sea (other species of salmon and seabirds have also been adversely influenced during these high abundance years) [43][44][45][46][47]. These studies reveal that the intraspecific competition among other pink salmon and interspecific competition among other species can vary significantly between odd and even-years.
In this study, we present genome assemblies for both odd-year and even-year lineages, develop a transcriptome to help in the annotation of these assemblies, and analyze polymorphisms found between groups. We were able to identify large Fst peaks adjacent to many centromeres and to verify one major fusion or deletion on LG15_El12.1-15.1 by combining polymorphism data with long-read sequencing of both year-classes. We also identified regions of the genome that have diverged between odd and even-year lineages possibly as a response to selection. These regions of the genome are important aspects of pink salmon biology and provide greater insight into the evolutionary divergence of the lineages.

Animal care
Fisheries and Oceans Canada Pacific Region Animal Care Committee (Ex. 7.1) was the authorizing body for animal care carried out in this study. All salmon were reared, collected, or euthanized in compliance with the Canadian Council on Animal Care Guidelines.

Genome assemblies
Two genome assemblies were produced for this study. The first assembly was generated from an odd-year male and was followed by a even-year male assembly. The differences in methodology between assemblies reflect the availability of resources at the time they were generated. This is why different genomes were used for synteny and why Hi-C data was only available for the even-year assembly.
A mature male pink salmon was sampled from the Big Qualicum River Hatchery (NCBI BioSample: SAMN16688056) on September 19, 2019 (odd-year) by hatchery personnel and euthanized by concussion as specified in section 5.5 of the Canadian Council on Animal Care guidelines. A mature male pink salmon was also sampled from the Quinsam River Hatchery (NCBI BioSample: SAMN18987060) by hatchery personnel in the same manner on July 28, 2020 (even-year). We dissected liver, spleen, kidney, and heart tissues from the carcasses and flash-froze them on dry ice immediately. These tissues were stored at -80˚C. We used a Nanobind Tissue Big DNA Kit (Circulomics) to isolate high-molecular DNA following the manufacturer's protocol from multiple tissues. In addition, Short Read Eliminator Kits (Circulomics) were used to reduce the fraction of small DNA fragments in the DNA extractions following the kit protocol for DNA samples to be sequenced on Oxford Nanopore Technologies (ONT) platforms.
We generated sequencing libraries with the prepared DNA using a Ligation Sequencing Kit (SQK-LSK109 ONT) following the manufacturer's protocol. The libraries were sequenced on a Spot On Flow Cell MK1 R9 with a MinION (ONT, even and odd-year assemblies) or a Pro-methION (R9.4.1 flow cell, even-year assembly only). Libraries sequenced on the PromethION were size selected using magnetic beads (0.4:1 ratio). DNase flushes were performed to increase yield according to the manufacturer's instructions. We also tried to add 1% DMSO immediately before sequencing to reduce secondary structures that might block pores and reduce sequencing efficiency for one flow-cell (with a minor increase in pore occupancy, more titration will be needed to identify if there are benefits of adding DMSO). FASTQ sequence files were generated either using the Guppy Basecalling Software (version 3.4.3+f4fc735 for sequences from the MinION) with default settings or MinKNOW v3.4.6 (for sequences from the PromethION).
Short-read sequence data were generated for genome polishing for the even-year genome assembly (NCBI SRA accession: SRX10913279 -SRX10913282) and the odd-year genome assembly (NCBI SRA accession: SRX6595859 -SRX6595860). We generated the short-read data for the even-year genome by shearing 1ug of DNA (pink even-year male described above) with a COVARIS LE220 (Covaris) using the following configuration in a 96 microTUBE plate (Covaris): duty 20, pip450, cycles/burst 200, total time 90s, pulse spin in between 45s treatment. The library was then constructed using the MGIEasy PCR-Free DNA Library Set (MGI) following the manufacturer's protocol. The library was then sequenced on an MGISEQ-200RS Sequencer (150 + 175 PE).
We generated the short-read sequence data for polishing the odd-year genome assembly for a previous assembly that was not published because the contiguity of the assembly was low. The sequences were from an odd-year haploid female produced at Fisheries and Oceans Canada using source material from the Quinsam River Hatchery (NCBI BioSample: SAMN12367892). To produce the haploid salmon, we applied UV irradiation (560 uW/cm 2 for 176 s) to sperm from a Quinsam River male pink salmon (to destroy parental DNA) immediately before fertilizing eggs from a Quinsam River female pink salmon. Prior to sequencing, the individual was confirmed to be haploid using a panel of 11 microsatellites. The details of the library preparations and sequencing technology can be found on the NCBI website (NCBI SRA accession: SRX6595859 -SRX6595860).
We created a Hi-C library for the even-year genome assembly using the Arima-HiC 2.0 kit (Arima Genomics-manufacturer's protocol) with liver tissue from the even-year male (NCBI SRA accession: SRR14496776). The library was then sequenced on an Illumina HiSeq X (PE150). A Hi-C library was only successfully generated for the even-year genome assembly.
After sequencing, we produced initial genome assemblies with the Flye genome assembler (version 2.7-b1587 -odd, 2.8.2-b1695 -even) [ After the genome assemblies were polished, we identified the order and orientation of contigs/scaffolds on pseudomolecules/chromosomes for the odd-year genome using a previously published genetic map [56] and synteny to the coho salmon genome (NCBI: GCF_002021735.2). Chromonomer [57] (version 1.10) was used to order the contigs/scaffolds using the genetic map (parameters:-disable_splitting). Ragtag [58] (version 1.0.1) was used to order the contigs/scaffolds using synteny to the coho salmon genome (default parameters). We used a custom script [59] to compare the contig order files output by Chromonomer and Ragtag (.agp files) and manually reviewed the output for discrepancies. The manually curated order and.fasta files were submitted to the NCBI.
To order and orient contigs and scaffolds on pseudomolecules for the even-year genome, we mapped Hi-C reads to the polished assembly using scripts from Arima Genomics [60]. The output alignment file was then converted to a.bed file using BEDtool bamtobed (version 2.27.1) [61] with default parameters and sorted using the Unix command 'sort -k 4.' After the Hi-C reads were mapped to the genome assembly, Salsa2 [62, 63] was used to further scaffold the contigs and initial scaffolds (parameters: -e GATCGATC,GANTGATC,GANTANTC,GAT-CANTC). After scaffolding, we mapped the remaining contigs and scaffolds onto pseudomolecules/chromosomes using the same strategy as for the odd-year genome assembly (see above) except a newer genetic map was used [64] (an odd-year genetic map was the only available) and the rainbow trout genome assembly (NCBI: GCF_013265735.2, [65]) was chosen for synteny. The proposed order and orientation was then reviewed manually using Juicebox (version 1.11.08) [66] before submission to the NCBI. The.hic and.assembly files used by Juicebox were produced using the pipeline from Phase Genomics [67]. The nomenclature for the chromosomes was based on the linkage group from the genetic maps and from the Northern pike orthologous chromosomes in an attempt to standardize nomenclature across salmonids [68].
A BUSCO (Benchmarking Universal Single-Copy Orthologs) version 3.0.2 analysis [69] was used to assess assembly quality. We performed these analyses after polishing assemblies, but before mapping contigs/scaffolds onto chromosomes. The lineage dataset used in this analysis was actinopterygii_odb9 (4584 BUSCOs). The parameters used were: -m genome and -sp zebrafish.
A Circos plot was generated from the odd-year genome assembly using Circos software version 0.69-8 [70]. We identified homeologous regions of the genome with SyMap version 5.0. 6 [71] using a repeat-masked version of the assembly without unplaced scaffolds or contigs (default settings). Repeats had previously been identified by NCBI and were masked by us using Unix commands. The output from SyMap was formatted and summarized using scripts from Christensen et al.
We extracted DNA from tissues stored either in 100% ethanol or RNAlater (ThermoFisher) using the manufacturer's protocol [74]. Whole-genome sequencing libraries were produced at McGill University and Génome Québec Innovation Centre (now the Centre d'expertise et de services Génome Québec). The libraries were generated using the NxSeq AmpFREE Low DNA Library Kit and NxSeq Adaptors (Lucigen). They were then sequenced on an Illumina HiSeq X (PE150).
We identified nucleotide variants using GATK [75-77] (version 3.8). Unfiltered paired-end reads were aligned to the Racon corrected odd-year genome assembly (as other versions were unavailable at the time-available at: https://doi.org/10.6084/m9.figshare.14963721.v1) using bwa mem (parameters: -m) and the sort command from Samtools. Picard's [78] (version 2.18.9) AddOrReplaceReadGroups was used to change read group information (with stringency set to lenient). Samtools was used to index the resulting alignment files, and the MarkDuplicates command from Picard was used to mark possible PCR duplicates (lenient validation stringency). The MarkDuplicates command was also used to merge.bam files if multiple sequencing lanes were used to sequence the sample. Read group information was changed using the Picard command ReplaceSamHeader for these samples so that the library and sample ID were the same, but other information was not altered. This was performed so that GATK would treat the sample appropriately.

Transcriptome
To better facilitate annotation of the genome assemblies by the NCBI, we collected RNA-seq data from 19 tissues sampled from a juvenile female pink salmon (NCBI Accessions: SRX6595821-SRX6595839). Euthanasia of this salmon was performed by placing the salmon in a bath of 100 mg/L tricaine methanesulfonate buffered with 200 mg/L sodium bicarbonate. Team dissection was used to quickly remove tissues, and each tissue was stored in RNAlater Stabilization Solution (ThermoFisher) as recommended by the manufacturer.
We extracted RNA from the tissue stored in RNAlater Stabilization Solution using the Qiagen RNeasy kit (QIAGEN). Stranded mRNASeq libraries were generated at McGill University and Génome Québec Innovation Centre, with NEBNext dual index adapters. Libraries were then sequenced as a 1/39 fraction of a NovaSeq 6000 S4 PE150 lane at McGill University and Génome Québec Innovation Centre. These datasets were deposited to NCBI for use in their gene annotation pipeline (BioProject: PRJNA556728). No other analyses or transcriptome assemblies were performed on this dataset.

Population structure
As clustering techniques are sensitive to linkage disequilibrium, we used variants that were hard-filtered (including for allele balance) and filtered for linkage disequilibrium (see Wholegenome re-sequencing section for filtering details) for all population structure analyses. A DAPC analysis [83] was used to cluster individuals in R [84] using the following packages: adegenet [85], vcfR [86], and ggplot2 [87]. The number of DAPC clusters was determined using the find.clusters function and choosing the cluster count with the lowest Bayesian information criterion. Thirty principal components were retained with the dapc function. The variants used for the DAPC analysis were not yet mapped to chromosomes.
To complement the DAPC analysis, we also performed an Admixture (version 1.3.0) analysis [88] to identify clusters of individuals and quantify the admixture between the identified groups. To format the linkage disequilibrium thinned.vcf file, we used a custom Python script to rename the chromosomes to numbers [79] and PLINK (version 1.90b6.15) [89, 90] was used to generate.bed files (parameters:-chr-set 26 no-xy,-double-id). PLINK was also used to generate a principal components analysis. The admixture software was then used to identify the optimal cluster number based on the lowest cross-validation error value. The admixture values from this analysis were plotted in R.
To examine population structure based on the mitochondrion sequence, we generated a phylogenetic tree based on full mitochondria sequences. The genome assembly included a mitochondrion sequence, and this region of the genome was subset from the variant file using vcftools. The resulting file and the SNPRelate [91] package in R were used to generate the phylogenetic tree. The snpgdsVCF2GDS and snpgdsOpen functions were used to import the data, the snpgdsDiss function was used to calculate the individual dissimilarities for pairwise comparisons between samples, the snpgdsHCluster function was used to generate a hierarchical cluster of the dissimilarity matrix, the snpgdsCutTree function was used to determine subgroups, and the snpgdsDrawTree function was used to plot the dendrogram.
From the variants with minimal filtering and the variants after all filters had been applied, the heterozygosity ratio was separately calculated based on the number of heterozygous genotypes divided by the number of alternative homozygous genotypes [92,93]. The number of heterozygous and homozygous genotypes were counted using a python script from Christensen et al. (2020) [79]. Heterozygous genotypes per kilobase pair (kbp) was calculated by dividing the heterozygous genotype counts by the genome size (2,528,518,120 bp) and then multiplied by 1000. This calculation was used on the variants with minimal filtering not yet mapped to chromosomes.
The number of shared alleles was calculated as a metric for relatedness using custom scripts for the variants with minimal filtering and which were mapped to chromosomes [94]. This value is calculated by counting the number of alleles an individual has in common with another individual and is similar to previous work [95-97]. The percent shared alleles was calculated in R (number of shared alleles divided by the total allele count multiplied by 100) and plotted using the reshape2 [98] and pheatmap [99] R packages.
Fst, nucleotide diversity (within populations-pi and between-dxy), and Tajima's D were calculated and plotted using the R packages PopGenome [100], dplyr [101], tidyr [102], stringr, and qqman. In PopGenome, all metrics were calculated using a sliding window of 10 kbp and the data were visualized as a Manhattan plot using qqman. A 10 kbp window was chosen to minimize the influence of individual variants while maintaining fine-scale resolution to identify regions of the genome that have interesting profiles. We used the populations module from Stacks version 2.54 [103] to calculate the number of private alleles, percent of polymorphic variants, Fis (inbreeding coefficient), and Pi (nucleotide diversity within a population) for odd and even year class samples grouped as populations. A comparison was also performed to see how filtering influenced these metrics.

Genomic regions associated with population structure under selection
To identify regions of the genome associated with population structure identified in the DAPC analysis and potentially under selection, we performed an eigenGWAS analysis [104]. The format of the hard-filtered variants was converted to the appropriate format in PLINK, and the GEAR [105] software was used to run the eigenGWAS analysis (this was performed on a slightly different version of the genome assembly than the one available on the NCBI website, but only positions on chromosome 9 were minimally affected). Significance was corrected for using the genomic inflation factor to better identify markers potentially under selection rather than a result of genetic drift between populations. The genomic inflation factor corrected p-values were then plotted in R using the qqman [106] and stringr [107] packages. A Bonferroni correction was applied as a multiple test correction (alpha = 0.05). Only peaks with at least 5 SNPs within 100 kbp of each other were retained to reduce false-positives (nucleotide variants under selection are expected to be in linkage disequilibrium with surrounding variants and significant single variants not in linkage may be a consequence of spurious alignments). Average linkage disequilibrium declines rapidly after 100 kbp in cultivated coho salmon [108], and likely after shorter distances in wild populations. Multiple factors such as population dynamics (e.g., small population size), multiple associations in one region, or selection could explain linkage over greater genomic distances. At the time of writing, the genome assembly has not been annotated by NCBI, and synteny was used to identify candidate genes by using BLAST [109,110] to align variants with the lowest p-value to other annotated salmon genomes (coho salmon: GCF_002021735.2, sockeye salmon: GCF_006149115.1, Chinook salmon: GCF_018296145.1). Nucleotide diversity (pi) and other metrics were calculated using Stacks for these regions. Tajima's D values for these regions were generated in PopGenome.

Sex determination and sdY
We utilized a genome-wide association (GWA) of phenotypic sex to identify the region of the genome associated with sex for all pink salmon (individual year-classes were checked as well). This analysis was also used to identify where the contig from the genome assembly with the sdY gene should be placed. This was confirmed with synteny from the rainbow trout Y-chromosome (NC_048593.1) and manual inspection of the Hi-C data (it was placed in the evenyear genome assembly). The GWA analyses were performed using PLINK (parameters:logistic-perm). Synteny was identified from alignments to the rainbow trout genome assembly (GCF_013265735.2, [65]) using CHROMEISTER [111] (default settings).
When manually genotyping the presence/absence of the sdY gene by visualizing alignments in IGV [112], we noticed some males had increased coverage of the sdY gene, and two haplotypes were identified (4 variants in non-coding DNA). The haplotypes were manually genotyped (either as the CGGA or TTAC haplotype). To estimate the copy number of the sdY gene, we first used a python script to determine the average coverage of all hard-filtered variants [113]. The average coverage of the four variants in the sdY gene was then divided by the average coverage of all variants.

Genome assemblies
The odd-year assembly (GCA_017355495.1) had a combined length of~2.5 Gbp, with 20,664 contigs and a contig N50 of~1.8 Mbp. The even-year assembly had similar metrics, with a contig N50 of~1.5 Mbp, 24,235 contigs, and a length of~2.7 Gbp. We used a BUSCO analysis of known conserved genes to determine the completeness and quality of the genome assembly. Of the 4584 BUSCOs, 95.3% were found to be complete in the odd-year genome assembly (54.9% single-copy and 40.4% duplicated), 1.4% were fragmented, and 3.3% were missing. The even-year assembly also had 95.3% complete BUSCOs (51.5% single-copy and 43.8% duplicated), but more fragmented (1.6%) and fewer missing BUSCOs (3.1%).
The odd-year and even-year assemblies had 26 linkage groups and extensive homeologous regions between chromosomes (Fig 1, the even-year assembly is very similar to the odd-year assembly S1 Fig). The odd-year genome assembly contained similar levels of repetitive DNA and duplicated regions compared to other salmonids (Fig 1, [72 , 79, 114]). Like other salmon species, increased sequence similarity was also observed at telomeres between duplicated chromosomal arms (Fig 1). Peaks of increased Fst between odd and even-year lineages were commonly found at putative centromere locations (Fig 1, Table 1).

Population structure
A shared allele analysis (Fig 2) and both admixture and DAPC analyses (Fig 3) revealed a clear delineation between odd and even-year lineages. Parent-progeny and sibling relationships (relationships known during sampling) are highlighted by increased levels of shared alleles, but the majority of clustering appears to be related to geographical distance (Fig 2, S1 File). No apparent admixture was observed in the even-year class ( Fig 3B). In the odd-year lineage, estimated ancestry from the even-year group varied from zero to over forty percent ( Fig  3B). Odd-year ancestry ranged from 0.75-0.77 in Kitimat, 0.76-0.78 in Atnarko, and 0.92-1 in Quinsam salmon (Fig 3B).
A separate analysis of mitochondrial DNA was performed to further investigate the relationships between the odd and even-year lineages. Odd-year pink salmon had longer branch lengths in mitochondria dendrograms and haplotype networks with more uniform distributions of haplotypes (Fig 4A and 4B). The even-year salmon had two major haplotypes (Fig 4B). Mitochondrial sequence analyses revealed 21 unique mitochondria haplotypes among the 61 individuals with 1-19 steps between haplotypes (Fig 4). Based on the length of the sequence analyzed (16,822 bp) this represents a mutation frequency between 0.006% to 0.1%. One haplotype was shared between lineages and the closest haplotype that was not shared had 5 steps between year-classes (Fig 4). The mitochondrial analyses illustrate divergence between the odd and even-year lineages, but also raises questions regarding possible recent admixture based on a shared haplotype and an odd-year haplotype most closely related to an even-year haplotype.
Several metrics were calculated to quantify genetic divergence between and within yearclasses: heterozygosity ratios, heterozygous genotype per kbp, polymorphic sites, private alleles, and nucleotide diversity. Heterozygosity ratios in odd-year fish ranged from 1.5-4.56, with an average of 2.54 (excluding haploid individuals generated for a previous project) (S1 File). Even-year class individuals ranged from 1.09-1.78, with an average of 1.44 (S1 File). The average heterozygous genotype per kbp (excluding haploids) was 0.71 for odd-year salmon (range: 0.55-0.85) and 0.58 for even-year (range: 0.45-0.69) pink salmon. The Pearson correlation between heterozygosity ratio and heterozygosity per kb was 0.91 (excluding haploids). Salmon from odd-years had on average higher levels of polymorphic sites, increased private alleles, and increased nucleotide diversity ( Table 2). These values varied based on parameters used for filtering nucleotide variants ( Table 2). The average percent of shared alleles among odd-year fish was 76.13%, 74.42% among even-year individuals, and 71.04% between yearclasses (S1 File). Most analyses revealed increased genetic diversity among odd-year pink salmon than among even-year pink salmon and fewer shared alleles between odd and evenyear populations than within year-class.

Genomic regions associated with odd and even-year lineages
We identified regions of the genome with divergence between odd and even-year lineages using an eigenGWAS and Fst analysis (see Fig 1 and Table 1 for Fst analysis). Seventeen significant regions of the genome were discovered with the eigenGWAS analysis that contribute to the divergence between odd and even-year lineages ( Fig 5, Table 3, S1 Table). These regions are putatively under selection as genetic drift is partially accounted for through the genomic inflation factor. Multiple candidate genes under selection were identified in these regions ( Table 3, S1 Table). Nucleotide diversity, observed heterozygosity, and Tajima's D values for these regions are given in S1 Table. In addition to identifying divergent regions of the genome possibly under selection, we also identified Fst peaks between lineages. Seven of the eight largest Fst peaks between year-classes were located in the vicinity of a centromere (Fig 1, Table 1). More detail is presented on one of the largest Fst peaks. This peak is also associated with a large deletion or fusion. The Fst peak A heatmap of shared alleles between salmon is shown with clustering and a dendrogram. Each square represents the percent shared alleles after minor filtering of variants (bi-allelic SNPs). In addition to the legend displaying the colour representation of percent shared alleles, the sex, year-class, and river system sample information is colour-coded and shown on both rows and columns.  Table 4) is in Hardy-Weinberg equilibrium in the odd-year lineage (p = 0.984 with a chi-square test), but fixed in the even-year lineage (Fig 6B and 6C). When Oxford Nanopore reads from the two year-classes were aligned to the genome assembly, a heterozygous deletion or fusion from 51,670,144-52,926,328 was found in this region of the odd-year salmon sequences (S2 Fig). The~1.2 Mbp deletion/fusion may explain why the LG15_El12.1-15.1 Fst peak was one of the largest and widest (Fig 1, S2 Fig).

Sex determination and sdY
The sex-determination gene in salmonids, sdY [116], was located on a~110 kbp contig in the pink salmon odd-year genome assembly (NCBI accession: JADWMN010014055.1) and on a contig~367 kbp that was placed onto a chromosome in the even-year genome assembly. The sdY gene can be placed at one of the ends of LG20_El14.2 by using genome-wide association with sex as the trait of interest, Hi-C contact data (even-year genome), and synteny with the rainbow trout Y-chromosome or chromosome 29 (an autosome) of the coho salmon (Fig 7A,  S3 and S4 Figs).
LG20_El14.2, has the reverse orientation in the odd-year assembly compared to the genetic map (Fig 1), but was corrected to have the same orientation in the even-year assembly (S1 Fig). In both genome assemblies there is only one copy of the sdY gene, confirmed with a BLAST alignment of a sdY gene available in the NCBI database (KU556848.1) to the respective assemblies. From a self-alignment of the sdY-containing contig, the majority of this contig is highly repetitive, > 90 kbp out of~110 kbp. From the alignment of the sdY-containing region in pink salmon to the coho salmon chromosome 29, only a small portion of the Y-chromosome appears to be unique to the Y-chromosome (S4 Fig). Genotypes were called for the majority of this region for males and females, and the main difference related to sex was that all females had large runs of homozygosity while many males had large runs of heterozygosity (S5 Fig,  S1 File).   From previous research [117,118], a pseudo growth hormone 2 gene was shown to be tightly linked to sex-determination in pink salmon. Four tandem duplicates of this gene (NCBI: DQ460711.1) were identified on the same contig in the even-year genome assembly,  LG01_El19. LG24_El10.1-25.1 47355926-47430898 47398112 microtubule-associated protein 9 map9 LG25_El23

PLOS ONE
but only two copies were found in the odd-year genome assembly on separate contigs (S1 File). As these contigs were not mapped to a chromosomal position, it is likely that parts of the Y-chromosome specific region remain incomplete in these two assemblies. There were two sdY haplotypes (variants found in non-coding DNA) observed in both odd and even male pink salmon (Fig 7B, S1 File). Additionally, some males possessed multiple copies of the sdY gene (10/25 or 40%, assuming that 1.5x coverage or greater was due to a second copy) (Fig 7B). Although both salmon used for sequencing the genomes appeared to have a single copy of the sdY gene (or the sequences were collapsed during assembly), males from Atnarko, Deena, Quinsam, and Yakoun had multiple copies of sdY. While most males had 1-2 copies of the sdY gene, one Quinsam male appeared to have four copies. The CGGA sdY  haplotype (see Materials and Methods section) was only identified in a single odd-year male pink salmon, while the TTAC haplotype was evenly distributed between year-classes and was the only haplotype with multiple copies (S1 File). Based on manual inspection of the genotypes, long stretches of heterozygosity were observed near the sdY gene in some males, but not in others. In males with the TTAC sdY haplotype, there were extended or short runs of heterozygosity evenly distributed between yearclasses (S1 File). Even-year males with the TTAC sdY haplotype and a short run of heterozygosity were more likely to have multiple copies of the sdY gene (n = 4, average = 2.7) than the same group with long runs of heterozygosity (n = 4, average = 0.9, p = 0.017 with one-tailed, unpaired t-test). Any individuals with the CGGA sdY haplotype did not have stretches of heterozygosity near the putative location of sdY. One hypothesis to explain these results is that individuals with the CGGA sdY haplotype have an alternative sex chromosome.

Population structure
Similar to previous studies [25, 56], pink salmon population structure divergence was found to be greater between year-classes rather than based on geography at the whole genome level. Shared allele, DAPC, and admixture analyses point to a clear delineation of odd and even lineages, with the exception of the only sample from Japan. In British Columbia, the even-year lineage appeared to be more homogeneous than the odd-year lineage based on the admixture analysis and several population metrics such as nucleotide diversity. In a species-wide range context, these results exhibit the same trend of a major divergence between odd and even-year lineages previously observed in other studies (with minor geographic population structure within a lineage) [15,25,56].
Divergence between lineages was also revealed by whole mitochondrial sequences. There were 21 unique mitochondria genotypes among the 61 individuals sampled, and only one of these haplotypes was shared between lineages. While the number of unique haplotypes was the same between lineages, most of the even-year class haplotypes (8 out of 10) were similar in sequence. The two major haplotypes seen in the even-year class were consistent with the Alaskan A and AA haplotypes seen in Churikov and Gharrett (2002) [35], as were the numerous and more distantly related odd-year haplotypes.
The low nucleotide diversity of mitochondrial haplotype networks and the increase of rare haplotypes have led previous studies to conclude that pink salmon (with some local exceptions) have undergone a bottleneck during the Pleistocene interglacial period and rapid expansion since the last glacial maximum or earlier [35,36,119]. The interconnected mitochondrial networks in these studies have inner shared haplotypes between year-classes. Churikov and Gharrett (2002) suggested that these observations supported a model where a year-class might go extinct and an alternate year-class would then replace that population rather than continued gene flow between year-classes that would be necessary to otherwise explain the shared haplotypes (incomplete lineage sorting was tested) [35]. The mitochondrial network seen in this study is consistent with that hypothesis. An alternative hypothesis is that environmental factors influence maturation timing and the two-year life-cycle of pink salmon, and gene flow between year-classes only occurs when environmental conditions favour changes to the two year life-cycle, as that seen in the introduction of pink salmon to the Great Lakes [7,37,38]. Estimates of divergence based on mitochondrial sequences suggest that odd and even-year lineages (from East Asia and Alaska) are relatively recent for pink salmon as a species (generally less than 1 million years ago) and divergence may have began during the Pleistocene interglacial period or later [24, 35, 36].
It has previously been reported that the odd-year lineage of pink salmon has higher levels of heterozygosity, private alleles, and allelic richness [25,56]. A similar trend was observed in this study with the heterozygosity ratio, heterozygous genotypes per kbp, private alleles, and other metrics assessing nucleotide diversity. Several factors could help explain the reduced levels of nucleotide diversity seen in the sampled even-year populations. Tarpey et al. (2018) suggested three possibilities and these possibilities also apply to our current results, 1) the odd-year lineage was older and the even lineage was derived from the odd-year lineage, 2) there was a past reduction in even-year lineage(s), and 3) genetic variation was lost during adaptation [25]. Further sampling will be required to understand if this phenomenon is seen in all even-year populations (especially as lower heterozygosity in the even-year lineage is not universally supported, e.g., [22]). This information is important to interpret which hypothesis is better supported or if another model is better suited (e.g., extirpated lineage replaced by alternate year-class).

Genomic regions putatively under selection
A large component of the genetic and phenotypic diversity between pink salmon year-classes likely originates from genetic drift as there is little evidence for gene flow between lineages. However, in addition to genetic drift, these lineages may experience different selective pressures even if they occupy the same streams. As mentioned in the Introduction, population density between lineages is often different and this can generate different ecological environments. EigenGWAS (to identify regions potentially under selection-this section) and Fst analyses (to identify major regions of the genome that have diverged between lineages-next section) were used to identify regions of the genome potentially responsive to these environmental differences between pink salmon year-classes (17 regions in the eigenGWAS analysis and eight regions in the Fst analysis). Candidate genes under selection were organized into three broad categories (immune system, organ development/maintenance, and behaviour), and each is discussed below.
Several factors could influence why these immune related genes might be under selection between odd-year and even-year populations of pink salmon. For example, altered migration patterns (reviewed in [141,142]), increased pathogen loads between year classes due to increased density (reviewed in [142,143]), and increased physiological stress from competition and increased number of predators during years with larger returns (e.g., [142]) could all influence the differences observed in immune related genes. Further investigations into the nature of these genes in pink salmon may uncover the environmental factors and selective pressures relevant to the evolutionary history of these pink salmon lineages. Preliminary metrics (i.e., observed heterozygosity, Tajima's D, and manually annotated genotype haploblocks) suggest that there have been recent selective sweeps in the even-year populations for all of these genes, while only two genes appeared to experience the same sweeps in the odd-year populations (for the opposite haploblocks), arid3a and H2-Aa.
Organ development/maintenance. Salmon go through nutritional and behavioural changes that require organ-level alterations and maintenance throughout their life-cycle. This can be observed in developing salmon that transition from planktivorous to piscivorous diets. In the eye, this transition requires the development of new functionality such as night vision to chase prey. One example of such a transition is the change of opsins in Pacific salmon during maturation, from UV opsins in hatched salmon to blue opsins in later life stages [144,145].
Variation in vision related genes have previously been observed between sockeye salmon populations [79]. In Atlantic salmon, six6, a gene related to eye development, daylight vision [146,147], and fertility [148] was also found to be associated with age at maturity [149,150] and later with stomach fullness during migration [151]. These studies suggest that genetic variation influencing organ development, transition, or maintenance are important components influencing salmonid evolution. Similar to six6 in Atlantic salmon, Protein tyrosine phosphatase receptor type J (PTPRJ) [152], histidine N-acetlytransferase (hisat) [153][154][155][156][157][158][159][160], and microtubule-associated protein 9 (MAP9 or ASAP) [161] all appear to play roles in proper vision. The variation in these genes may represent differences in selective pressure between odd and even-years and could be driven by the different population dynamics observed between odd and even-year populations. Preliminary metrics suggest odd-year populations have had recent selective sweeps of all three of these genes, and even-year populations have had a recent selective sweep for the opposite haploblock of the hisat gene.
Cystathionine gamma-lyase (CTH) may have, among other roles, a function in hearing [162][163][164], and could have been influenced by similar population dynamics as those suggested for vision-related genes. Multidrug and toxin extrusion protein 2 gene (SLC47A2) is not related to a specific organ, though it may have a special role in the blood-brain barrier [165,166]. Instead, it may help in removing toxins [166], which might accumulate in more dense populations. For example, dense spawning populations of salmon have been shown to drastically decrease dissolved oxygen in a stream [167] and increase ammonium and other toxin levels (reviewed in [168]). Evidence for a selective sweep of SLC47A2 was observed in preliminary metrics of odd-year populations.
Behaviour. Fish display consistent behavioural differences from each other, analogous to human personalities [169]. Personality variation in a population may represent adaptive solutions to different environmental pressures [169]. In high density populations, such as the oddyear populations, more aggressive behaviours during high-density spawning conditions [42] could result in more offspring, but might waste energy in lower-density conditions. Associations to genes related to behaviour have previously been identified among sockeye salmon populations [79], and under selection between wild and farmed Atlantic salmon [170]. In the present study, protein-methionine sulfoxide oxidase mical2b (mical2b) [171,172] and cell division control protein 42 homolog (CDC42) [173], both putative genes found in the eigen-GWAS analysis between even and odd-year pink salmon, have previously been found to be associated with anxiety/reactiveness and schizophrenia, respectively. Preliminary evidence of a recent selective sweep was identified in even-year populations for the mical2b gene and in the odd-year populations for the CDC42 gene.

Fst peaks between odd and even-year lineages
A single major chromosomal polymorphism (either a fusion or deletion) was identified proximal to a centromere on LG15_El12.1-15.1. This region was characterized by~4 Mbp runs of homozygosity/heterozygosity. This region was identified from an Fst analysis because nearly the entire region was fixed in the even-year lineage, but appeared to segregate as a single locus in Hardy-Weinberg equilibrium in the odd-year lineage. It is difficult to distinguish between a deletion and a chromosomal fusion in these analyses. Previous research supports chromosomal variants in pink salmon [174] and a species-specific fusion of this chromosome [68], but further research will be needed to test this hypothesis.
Interestingly, runs of homozygosity/heterozygosity were common at centromeres rather than an effect of chromosomal polymorphisms (all but two of the 26 pink salmon chromosomes are metacentric-the other two are subtelocentric [175]). Six other major runs of homozygosity/heterozygosity were also located near centromeres and they differed between lineages. All of these Fst peaks extend for at least 1 Mbp and were in Hardy-Weinberg equilibrium. The only other Fst peak, besides the one on LG15_El08.2-20.1 that was fixed in a population, was the peak on LG21_El24.2-22.1. All of the other Fst peaks were skewed toward opposite genotypes, with the exception of LG26_El09.1-11.1, which varied by the fraction of homozygous and heterozygous genotypes between odd-year and even-year populations. It is expected that regions with reduced recombination, such as centromeres, will have increased runs of homozygosity and reduced genetic diversity (reviewed in [176]). This may help explain why there are long runs of homozygosity at centromeres, but not why there are differences between lineages at these loci. Genetic drift or selection such as centromere drive (a form of meiotic drive thought to occur during female meiosis) would also need to be considered.
The centromere drive hypothesis posits that a centromere can be retained in a female gamete (i.e., retained in the oocyte rather than the polar body) more often than an alternative centromere during meiosis due to an advantageous DNA sequence mutation at the centromere or from mutations in centromere associated proteins (reviewed in [177][178][179]). In populations that become isolated, the competition between centromere sequences can quickly drive differentiation at these regions between the populations and result in hybrid defects should they come into contact again [178]. These observations reveal that the pink salmon lineages may be at a point where speciation is a likely outcome as these large centromere differences could cause hybrid defects. For example, in medaka, genomic diversity at non-acrocentric repeats in centromeres were associated with speciation [180].
The centromere drive hypothesis may further shed light on the fixation of the Fst peak on LG15_ El12.1-15.1. Robertsonian fusions (assuming that the Fst peak on LG15_El12.1-15.1 is indeed associated with a fusion rather than a deletion) can generate centromeres that are preferentially able to segregate to the egg during female meiosis [179]. This could help drive the fusion to fixation in a population. Alternatively, if the telocentric chromosomes instead of the fused metacentric chromosome had more effective centromeres, the telocentric chromosomes would become fixed. Further studies will be needed to confirm if there is indeed a fusion instead of a deletion (e.g., fluorescence in situ hybridization) and that the fusion leads to fixation by centromere drive (e.g., studying segregation distortion in crosses between fish with and without the fusion).

Sex determination and sdY
With the discovery of a novel sex-determining gene in salmonids [116], and previously with closely linked genetic markers [181,182] researchers have been able to identify instances of sex-determination switching between chromosomes in salmonids [183][184][185][186][187]. As suggested in Yano et al. (2013), Y-chromosome switching may act in response to (expected) degeneration of the Y-chromosome due to mutation accumulation from reduced recombination [188]. In pink salmon, sdY was located on LG20_El14.2, but we suggest there may be an alternative location as well. Several pieces of information indicate that LG20_El14.2 may not be the only location of the sex-determining gene, sdY, in the pink salmon genome. For instance, there were two sdY haplotypes and several males had multiple copies of this gene. Also, all males with the CGGA sdY haplotype had a run of homozygosity similar to most females on the LG20_El14.2 chromosome near the putative location of sdY. We identified the CGGA sdY haplotype in even-year males from Snootli Creek (2 out of 3 males), Kitimat River (3 out of 3), and the Yakoun River (2 out of 3). The haplotype was not observed in even-year males from Deena Creek (n = 3) or the Quinsam River (n = 3). The single odd-year male with the CGGA sdY haplotype was from the Kitimat River (1 out of 2). It is expected that near the sdY gene, recombination is reduced and mutations would accumulate between the X and Y-chromosomes as a result of reduced recombination. Females tend to have long runs of homozygous genotypes where recombination is reduced and males tend to have long stretches of heterozygous genotypes when reads from the X and Y-chromosome align at the same location [79]. Since the males with the CGGA sdY haplotype have long runs of homozygous genotypes at the LG20_El14.2 region, as most of the females do, we suggest that the CGGA sdY is at another location in the genome in these individuals. We were unable to identify a precise putative alternative location because there were too few individuals with the CGGA sdY to obtain a signal from a genome wide association analysis, however, the potential discovery of another salmon species with alternative sdY locations, further supports the hypothesis of Y-chromosome switching put forth by Yano et al. (2013) for salmonids [188].

Conclusions
We generated reference genome assemblies for both pink salmon lineages, RNA-seq data for genome annotation, and whole genome re-sequencing data to expand the available resources for this commercially important and evolutionarily interesting species. The coupled whole genome re-sequencing study of 61 individuals from several streams in British Columbia (and one from Japan) helped us to characterize regions of the genome that have diverged between the temporally isolated groups. The amount and degree of lineage-specific genomic variation suggests that there is little gene-flow between the year-classes, but the shared variants such as whole mitochondrial and sdY haplotypes suggests that there has been enough recent gene-flow or alternative year-class replacement to maintain these similarities. Divergence at centromeres between the two lineages may be a consequence of centromere drive (or genetic drift and reduced recombination) and represent early stages of speciation. Genes related to the immune system, organ development/maintenance, and behaviour were divergent between odd and even-year classes as well. These example lineage defining differences offer us a glimpse into the evolutionary landscape and the selective pressures or demographic histories of pink salmon.
Supporting information S1 File. Sample information. The sample tab has metadata about each sample, including information on sex, river, and year-class (latitude and longitude locations are approximate). The StatsAllFilters tab shows metrics from the.vcf file after filtering for LD (see methods). Stats1stFilter has the same information, but from the.vcf file after only preliminary filtering (see methods). The eigenGWAS tab contains the DAPC values used in the eigenGWAS analysis (see methods). The Mitochondrion tab shows metadata used to generate the mitochondria figures. The GPS tab shows the coordinates used in the sample map. The Admixture tab has the values output from the admixture analysis. For each tab with LG, these sheets have manually genotyped areas and calculations of HWE. The PrivateAlleles tab has metrics output from Stacks. The SharedAlleles tab has a matrix of shared alleles between individuals in long format and statistics on the right. The Y-Chrom tab has information about the sdY haplotypes. The GWAS tab has metadata used in the GWAS analysis. The GHp tab displays the alignments of the growth hormone pseudogene and sdY gene to the odd and even genome. LG15_El08.2-20.1 and a chromosomal polymorphism, either a deletion or evidence of a chromosomal fusion. A) LG15_El08.2-20.1 is depicted with the distance and location of the purposed polymorphism (in light translucent red). Scaffolds/contigs that comprise the region surrounding the polymorphism are shown below the chromosomal depiction, with a blue arrow showing where multiple small contigs were placed. B) Synteny with rainbow trout and Northern pike is shown based on CHROMEISTER [111] alignments. C) ONT/Nanopore reads that were used to generate the genome assemblies were aligned back to the odd-year genome and visualized with IGV. Reads in the odd-year individual are shown flanking the deletion (the display was split because the region was too large to adequately visualize continuously, ellipses mark the split). The proposed deletion is shown below the long reads. . The blue boxes represent chromosomes/pseudomolecules (the top is the proposed Y-specific region and the bottom is the rest of LG20_El14.2) and the green boxes represent scaffolds or contigs mapped to this chromosome. Red points represent contacts (close proximity) between regions. There are multiple inversions between the pink salmon and coho salmon genome seen in the dotplot, but the contact map supports the order and orientation for the pink salmon genome assembly and these could represent actual inversions between species instead of assembly errors. (TIF)

S5 Fig. Sex determining region of the even-year pink salmon with genotype information.
Genotypes are shown from an IGV [112] screenshot for the 61 samples of pink salmon for the region with the sdY sex-determining gene. The top portion shows the distance of the Y-specific genome region (~3.2 Mbp) and the contig/scaffold boundaries that make up this region are shown as vertical lines. Below the distances, allele frequencies for each locus are shown, and below that individual genotypes. The x-axis of the genotypes represent loci and each line on the y-axis represents an individual pink salmon. The dark-blue colour is a homozygous reference genotype, the light-blue colour a heterozygous genotype, and the green genotype is for a homozygous alternative locus. There are large stretches (1-2 Mbp) of heterozygosity and homozygosity based on sex. Please note that there is a possible inversion (from a mis-assembly) in this region as the runs of homozygosity and heterozygosity are broken by a section from~600 kbp and~1,300 kbp. (TIF) S1